GVPT Maths Camp

Data Wrangling and Github

Learning objectives

  1. Import your data

  2. Clean your data

  3. Explore relational data

  4. Start to manage your scripts with Github

A familiar problem

A solution

  • Track changes to your documents, code, or data over time

  • Work from one document

  • Have access to your work from anywhere

  • Create safe points in case something breaks or you want to experiment

Git and Github

Open source version control software.


Think R.

A website that allows you to store your Git repositories online and makes it easy to collaborate with others.


Think RStudio.

Why should I use Git and Github? 🤔

  • More reproducible, transparent research

  • Better version control

  • Easy collaboration with others

The basics

Four verbs you need to know to use Git for version control:

  1. add

  2. commit

  3. push

  4. pull

Using Git in RStudio

Three different options:

  1. RStudio GUI

  2. Shell/terminal

  3. Github desktop1

Repositories

  • A repository is like a folder for your project, but better!

  • Organises your work

  • Displays useful information, including a general description, navigation, changes

  • A great tool for project-oriented workflows

Starting a new project: create a repository

Starting a new project: create a repository

Starting a new project: create a repository

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: existing R project

  • We already have R projects that we started yesterday.

  • We can sync the existing R project with our new repository.

Sync your online repository with RStudio: existing R project

The usethis R package is a brilliant helper package.


RStudio GUI: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Shell/terminal: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Working with others

Github is like Google Docs for your code.

EXERCISE

  1. Create a new Github repository for this camp.

  2. Sync your existing R project to this new repository.

  3. add your scripts from yesterday and today.

  4. Write a helpful commit message for your future self.

  5. push your work up to Github.

  6. Add me as a collaborator: @hgoers.

Data wrangling

Source: R4DS

World Bank GDP data

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

  • Official source of global and historical GDP data

  • A very common control variable for IR and CP analysis

  • Very frustratingly messy!

Download global GDP data

Source: World Bank

Working with external data

Introducing here::here()

install.packages("here")


Points to where you are on your computer. Updates for everyone on any computer!

here::here()
[1] "/Users/harrietgoers/Documents/intro_to_r_ps"


EXERCISE: See where you are!

Reading in your csv

  • I like to store raw data in a folder called data-raw within my project.

  • I then store any clean data that is ready for analysis in a data folder within my project.

You will start to see this structure throughout my scripts. You should come up with something that works for you. It should be intuitive. This will help future you and any other people who want to explore your work navigate your projects.

Reading in your csv

library(tidyverse)

gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"))

head(gdp_raw)
# A tibble: 6 × 3
  `Data Source`               `World Development Indicators` ...3               
  <chr>                       <chr>                          <chr>              
1 Last Updated Date           2023-03-01                      <NA>              
2 Country Name                Country Code                   "Indicator Name,In…
3 Aruba                       ABW                            "GDP (current US$)…
4 Africa Eastern and Southern AFE                            "GDP (current US$)…
5 Afghanistan                 AFG                            "GDP (current US$)…
6 Africa Western and Central  AFW                            "GDP (current US$)…


EXERCISES

  1. Read the ?read_csv help file. What arguments does this function take?

  2. Head to the readr package documentation and find what other file types you can read in.

Skipping non-relevant rows

Source: World Bank

Skipping non-relevant rows

gdp_raw <- read_csv(
  here::here("slides", "data-raw", "wb_gdp.csv"), skip = 4, col_select = 1:66
)

head(gdp_raw)
# A tibble: 6 × 66
  `Country Name`       `Country Code` `Indicator Name` `Indicator Code`   `1960`
  <chr>                <chr>          <chr>            <chr>               <dbl>
1 Aruba                ABW            GDP (current US… NY.GDP.MKTP.CD   NA      
2 Africa Eastern and … AFE            GDP (current US… NY.GDP.MKTP.CD    2.13e10
3 Afghanistan          AFG            GDP (current US… NY.GDP.MKTP.CD    5.38e 8
4 Africa Western and … AFW            GDP (current US… NY.GDP.MKTP.CD    1.04e10
5 Angola               AGO            GDP (current US… NY.GDP.MKTP.CD   NA      
6 Albania              ALB            GDP (current US… NY.GDP.MKTP.CD   NA      
# ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
#   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, …

EXERCISES

Taking a look at this data set:

skimr::skim(gdp_raw)


What types of data do we have? Are they the right type of data?


Are we missing data points?

Tidy Data Structures

Source: R4DS

Tidying World Bank data

What do you want to do with your data?

  • I want to analyse country, regional, and global trends in GDP over time

What I need:

  • Annual data on each country’s GDP

  • The region to which each country belongs

Tidying World Bank data

To do:

  • Move the yearly data from columns to rows

  • Clean up these column names so that they are easier to use in R

  • Add regional data

Pivoting your data

gdp_df <- pivot_longer(
  data = gdp_raw, 
  cols = `1960`:`2021`,
  names_to = "year",
  values_to = "gdp"
)


colnames(gdp_raw)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"          
colnames(gdp_df)
[1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
[5] "year"           "gdp"           

Clean column names

Column names should not:

  • Have spaces

  • Start with numbers

Introducing janitor:

# Install the `janitor` package

install.packages("janitor")


gdp_df <- janitor::clean_names(gdp_df)

colnames(gdp_df)
[1] "country_name"   "country_code"   "indicator_name" "indicator_code"
[5] "year"           "gdp"           

Add region data

Introducing the countrycode package - the indispensable workhorse of country data:

# Install the `countrycode` package

install.packages("countrycode")

Add region data

library(countrycode)

# Add each country's World Bank region to the data set

gdp_df <- gdp_df |> 
  mutate(
    region = countrycode(country_name, 
                         "country.name", 
                         "region", 
                         custom_match = c("Turkiye" = "Europe & Central Asia"))
  ) |> 
  # Remove observations that are regions
  drop_na(region) |> 
  relocate(region, .after = "country_code")

Add region data

gdp_df |> 
  distinct(country_name, region) |> 
  head(10)
# A tibble: 10 × 2
   country_name         region                    
   <chr>                <chr>                     
 1 Aruba                Latin America & Caribbean 
 2 Afghanistan          South Asia                
 3 Angola               Sub-Saharan Africa        
 4 Albania              Europe & Central Asia     
 5 Andorra              Europe & Central Asia     
 6 United Arab Emirates Middle East & North Africa
 7 Argentina            Latin America & Caribbean 
 8 Armenia              Europe & Central Asia     
 9 American Samoa       East Asia & Pacific       
10 Antigua and Barbuda  Latin America & Caribbean 

Make sure all data are the right type

glimpse(gdp_df)
Rows: 13,454
Columns: 7
$ country_name   <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code   <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region         <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year           <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Make sure all data are the right type

gdp_df <- transmute(
  gdp_df,
  country_name, 
  region, 
  year = as.integer(year),
  gdp
)

glimpse(gdp_df)
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region       <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year         <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

How do countries compare over time?

ggplot(gdp_df, aes(
  x = year, y = gdp, colour = region, group = country_name
)) + 
  geom_line() + 
  theme_minimal()

How do regions compare over time?

By default, R will carry forward NAs. This is good!

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp, na.rm = T)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

Summary

Today you:

  • Learnt how to read in external data

  • Learnt how to clean up common problems using R